Studies for Segmentation of Historical Texts: Sentences or Chunks?

نویسنده

  • Florian Petran
چکیده

We present some experiments on text segmentation for German texts aimed at developing a method of segmenting historical texts. Since such texts have no (consistent) punctuation, we use a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields. We compare the performance of this approach on the task of segmenting of text into sentences, clauses, and chunks, and find that the task gets easier, the smaller grained the target segments are.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SemEval-2010 Task 11: Event Detection in Chinese News Sentences

The goal of the task is to detect and analyze the event contents in real world Chinese news texts. It consists of finding key verbs or verb phrases to describe these events in the Chinese sentences after word segmentation and part-of-speech tagging, selecting suitable situation descriptions for them, and anchoring different situation arguments with suitable syntactic chunks in the sentence. Thr...

متن کامل

Attention and L2 learners' segmentation of complex sentences

The main objective of the current study is to investigate L2 Japanese learners’ ability to segment complex sentences from aural input. Elementaryand early intermediate-level L2 learners in general have not developed the ability to use syntactic cues to interpret the meaning of sentences they hear. In the case of Japanese, recognition of inflectional morphemes is crucial for the accurate segment...

متن کامل

Automatic generation of large-scale paraphrases

Research on paraphrase has mostly focussed on lexical or syntactic variation within individual sentences. Our concern is with larger-scale paraphrases, from multiple sentences or paragraphs to entire documents. In this paper we address the problem of generating paraphrases of large chunks of texts. We ground our discussion through a worked example of extending an existing NLG system to accept a...

متن کامل

Key Lexical Chunks in Applied Linguistics Article Abstracts

In any discourse domain, certain chunks are particularly frequent and deserve attention by the novice to be initiated and by the expert to maintain a sense of community. To make a relevant contribution to the awareness about applied linguistics texts and discourse, this study attempted to develop lists of lexical chunks frequently used in the abstracts of applied linguistics journals. The abstr...

متن کامل

Text Segmentation Criteria for Statistical Machine Translation

For several reasons machine translation systems are today unsuited to process long texts in one shot. In particular, in statistical machine translation, heuristic search algorithms are employed whose level of approximation depends on the length of the input. Moreover, processing time can be a bottleneck with long sentences, whereas multiple text chunks can be quickly processed in parallel. Henc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012